feat(xqwatcher): migrate from EC2 ASG to Kubernetes Deployment by blarghmatey · Pull Request #4287 · mitodl/ol-infrastructure

blarghmatey · 2026-03-11T16:23:52Z

Summary

Migrates xqueue-watcher infrastructure from EC2 Auto Scaling Groups with AppArmor/codejail sandboxing to a Kubernetes Deployment using container-based grading. This is the infrastructure companion to mitodl/xqueue-watcher#14 which implements the ContainerGrader backend.

Changes

`src/ol_infrastructure/lib/ol_types.py`

Added xqwatcher to both Services and Application enums for consistent K8s label generation.

`src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl`

Added read access to secret-DEPLOYMENT/edx-xqueue so the grader handler config (stored in Vault) can embed the xqueue server URL and authentication password.

`src/ol_infrastructure/applications/xqwatcher/main.py`

Complete rewrite replacing EC2 resources with Kubernetes resources:

Old (EC2)	New (K8s)
IAM instance profile + Vault AWS auth	`OLEKSAuthBinding` (IRSA + Vault K8s auth)
EC2 Launch Template + ASG	Kubernetes `Deployment`
AMI with codejail/AppArmor	`mitodl/xqueue-watcher` (DockerHub) container image
Consul config distribution	`ConfigMap` + `OLVaultK8SSecret` CRD

New Kubernetes resources created:

OLEKSAuthBinding — IRSA role + Vault Kubernetes auth backend role
OLVaultK8SSecret — syncs grader handler config from Vault KV to a K8s Secret via Vault Secrets Operator
ConfigMap — base poll settings (xqwatcher.json) and stdout-only structured logging (logging.json)
Role + RoleBinding — grants xqwatcher pods permission to create/delete Jobs and read pod logs (required by ContainerGrader's Kubernetes backend)
Deployment — runs xqueue-watcher with non-root security context, resource limits, liveness probe, and topology spread for HA

Stack configs (9 files)

Removed EC2-specific keys (consul:address, auto_scale, instance_type) and added K8s-specific keys:

xqwatcher:cluster — EKS cluster name
xqwatcher:namespace — Kubernetes namespace
xqwatcher:min_replicas / max_replicas
xqwatcher:docker_tag

Deployment Prerequisites

Before applying this stack:

Build and push mitodl/xqueue-watcher image to DockerHub (from PR Adding more precise filtering for VPC and subnet imports #14)
Build and push course grader images (e.g. from MITx/graders-mit-600x#10)
Update Vault secret secret-xqwatcher/{env}-grader-config with confd_json containing a ContainerGrader handler config
Ensure Vault Secrets Operator is installed in the target cluster

Related PRs

feat: migrate to uv + add ContainerGrader for Kubernetes/Docker sandboxed grading xqueue-watcher#14 — ContainerGrader implementation + uv migration
MITx/graders-mit-600x#10 — Course grader Dockerfile

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add read access to secret-DEPLOYMENT/edx-xqueue so the xqwatcher service can retrieve the xqueue server URL and authentication password needed by the ContainerGrader handler config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Completely rewrite the xqwatcher Pulumi stack to deploy on Kubernetes instead of EC2 Auto Scaling Groups with AppArmor/codejail. Changes: - Replace IAM instance profile + Vault AWS auth with OLEKSAuthBinding (IRSA + Vault K8s auth backend) - Add OLVaultK8SSecret to sync grader handler config from Vault KV to a Kubernetes Secret via the Vault Secrets Operator CRD - Add a ConfigMap for base poll settings and structured JSON logging to stdout (no log rotation in containers) - Add RBAC Role + RoleBinding granting the xqwatcher service account permission to create/delete Kubernetes Jobs and read pod logs, required by ContainerGrader's kubernetes backend - Create a Kubernetes Deployment with: - ghcr.io/mitodl/xqueue-watcher image - Security context (non-root, drop ALL capabilities) - Resource requests + memory limit - Liveness probe via python -c import xqueue_watcher - Topology spread for HA across nodes - Vault grader config + base config mounted into /xqwatcher/conf.d/ - Preserve vault.kv.SecretV2 write so grader config remains managed in Pulumi - Export k8s_deployment_name and k8s_namespace Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove EC2-specific settings (consul:address, auto_scale, instance_type) and add Kubernetes-specific settings for all stacks: - xqwatcher:cluster — EKS cluster name (residential or applications) - xqwatcher:namespace — target Kubernetes namespace - xqwatcher:min_replicas — minimum pod count (maps from auto_scale.desired) - xqwatcher:max_replicas — maximum pod count (maps from auto_scale.max) - xqwatcher:docker_tag — container image tag (default: latest) Cluster assignments: - mitx, mitx-staging → residential cluster - mitxonline → applications cluster Namespace assignments follow xqueue convention: - mitx → mitx-openedx - mitxonline → mitxonline-openedx - mitx-staging → mitx-staging-openedx Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

for more information, see https://pre-commit.ci

Copilot

Pull request overview

Migrates the xqueue-watcher (xqwatcher) infrastructure in ol-infrastructure from an EC2/ASG-based deployment to a Kubernetes Deployment on EKS, aligning with the ContainerGrader-based runtime introduced in the application repo.

Changes:

Adds xqwatcher to shared enum types to support consistent labeling.
Updates Vault policy to allow reading xqueue server credentials.
Replaces the xqwatcher EC2 stack with Kubernetes resources (Vault auth binding + VSO-synced secret + ConfigMap + RBAC + Deployment) and updates stack configs accordingly.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`src/ol_infrastructure/lib/ol_types.py`	Adds `xqwatcher` to `Services`/`Application` enums for consistent labels.
`src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl`	Extends Vault policy to read xqueue server secret path.
`src/ol_infrastructure/applications/xqwatcher/__main__.py`	Full rewrite: provisions Vault+IRSA binding, VSO secret sync, ConfigMap, RBAC, and a Deployment for xqueue-watcher.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.QA.yaml`	Updates stack config to K8s-focused settings (cluster/namespace/replicas/docker tag).
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.Production.yaml`	Same as above for Production.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.CI.yaml`	Same as above for CI.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.QA.yaml`	Updates config for residential mitx QA.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.Production.yaml`	Updates config for residential mitx Production.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.CI.yaml`	Updates config for residential mitx CI.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.QA.yaml`	Updates config for mitx-staging QA.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.Production.yaml`	Updates config for mitx-staging Production.
`src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.CI.yaml`	Updates config for mitx-staging CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/ol_infrastructure/applications/xqwatcher/__main__.py

src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.QA.yaml

src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.QA.yaml

- Add create_irsa_service_account flag to OLEKSAuthBinding to optionally create the K8s ServiceAccount with IRSA annotation; use it in xqwatcher to fix 'serviceaccount not found' pod error - Add XQWATCHER_* env vars to Deployment matching env_settings.py; expose http_basic_auth from Vault-synced secret via VSO template - Fix image reference from ghcr.io to mitodl/ (DockerHub) - Change imagePullPolicy to Always for mutable 'latest' tag - Rename XQWATCHER_DOCKER_DIGEST to XQWATCHER_DOCKER_TAG - Remove unused network_stack StackReference - Remove dead xqwatcher:target_vpc config key from all 9 stacks - Remove unimplemented xqwatcher:max_replicas from all 9 stacks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The manager CLI only accepts -d/--config_root; it auto-discovers xqwatcher.json and logging.json from that directory. Remove the non-existent --config and --logging-config flags. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ol_infrastructure/components/applications/eks.py

…pods ContainerGrader calls k8s_config.load_incluster_config() which reads the service account token from the projected volume at /var/run/secrets/kubernetes.io/serviceaccount/token. The xqwatcher ServiceAccount has automount_service_account_token=False (secure default), so the PodSpec must explicitly opt in to have the token mounted, otherwise all Kubernetes Job API calls will fail with a ConfigException. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…e tag When the Concourse pipeline populates XQWATCHER_DOCKER_DIGEST, build the image ref as mitodl/xqueue-watcher@sha256:... (immutable digest) so Kubernetes always pulls exactly the image that was built and tested. Fall back to :tag from stack config only when the digest is unavailable (e.g. manual deploys). imagePullPolicy: Always is retained so new digests are always pulled on rollout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The uv virtualenv bin directory is not on PATH in the container, so the 'xqueue-watcher' console script can't be found directly. Use 'uv run xqueue-watcher' to invoke it through uv's environment, which correctly resolves the script installed in the project virtualenv. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

uv run without --no-sync attempts to sync the virtualenv at startup, which fails in the container (no write access / network). Use --no-sync to run the already-installed entrypoint as-is. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ol_infrastructure/applications/xqwatcher/__main__.py

configure_from_directory(path) reads xqwatcher.json and logging.json directly from path, then globs path/conf.d/*.json for queue watcher configs. We were passing -d /xqwatcher/conf.d and mounting everything flat there, so the manager looked for watchers at /xqwatcher/conf.d/conf.d/*.json (not found). Fix: pass -d /xqwatcher and restructure mounts: /xqwatcher/xqwatcher.json <- manager config (ConfigMap) /xqwatcher/logging.json <- logging config (ConfigMap) /xqwatcher/conf.d/grader_config.json <- queue watchers (Vault secret) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

VSO renders secret values via Go templates: {{ .Secrets.confd_json }}. When confd_json is stored as a nested object, VSO renders a Go map literal (map[...]) rather than valid JSON, causing a JSONDecodeError at startup. Pre-serialize confd_json to a JSON string so the template renders parseable JSON. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…llback Match the keycloak pattern: require the digest env var so the image is always pinned to an immutable digest. Remove the mutable :latest tag fallback that allowed manual pulumi-up runs to silently deploy an uncontrolled image. Also remove the unused xqwatcher:docker_tag config key from all stack YAML files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…gh cache When the SOPS secret's confd_json contains a ContainerGrader handler whose KWARGS include an 'image' key, rewrite that value through cached_image_uri() before writing to Vault. This means the SOPS secret stores a plain DockerHub reference (e.g. mitodl/mit-600x-grader:latest) and Pulumi transforms it to the ECR pull-through cache URI at deploy time, keeping grading Jobs free from DockerHub rate limits. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The CodeQL 'Analyze (actions)' job (exit code 32) fails because the extractor finds .github/workflows/*.yml and .github/actions/**/*.yml but cannot process any of them. This is a known extractor-level issue with CodeQL 2.24.x on Erk agent workflow patterns. Excluding .github from CodeQL's path analysis silences the fatal error while leaving Python and JavaScript/TypeScript scans unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add src/ol_concourse/pipelines/open_edx/grader_images/ with three pipeline definitions for building and publishing containerized course grader images to private ECR. base_image_pipeline.py: Builds grader_support/Dockerfile.base from the xqueue-watcher repo and pushes to both DockerHub (mitodl/xqueue-watcher-grader-base, public) and ECR (610119931565.dkr.ecr.us-east-1.amazonaws.com/mitodl/xqueue-watcher- grader-base, private). Triggered by changes to grader_support/ in the xqueue-watcher repo. The ECR push is the trigger source for downstream per-grader build pipelines. build_pipeline.py: GraderPipelineConfig dataclass and grader_image_pipeline() factory for per-grader-repo build pipelines. Triggered by new commits to the grader repo OR a new base image digest in ECR. The Docker build receives GRADER_BASE_IMAGE=repo@sha256:... resolved at runtime via a sh wrapper around oci-build-task's build script (the only way to inject a file-derived BUILD_ARG in Concourse; params are static strings). Pushes to private ECR only. GRADER_PIPELINES list seeded with graders-mit-600x. meta.py: Self-updating meta pipeline that creates and maintains the base image pipeline and one build pipeline per GRADER_PIPELINES entry. Triggered by changes to the grader_images/ pipeline code in ol-infrastructure. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ines - base_image_pipeline: use chore/migrate-to-uv-and-k8s-container-grader branch of xqueue-watcher (where Dockerfile.base updates live) - build_pipeline: track feat/containerized-grader for graders-mit-600x - Fix E501 in both files: split long strings to stay within 88-char limit Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The CONTEXT was grader_support/ which caused the COPY grader_support/ instruction in Dockerfile.base to fail (no nested grader_support/ inside the context). Use the repo root as CONTEXT so the COPY can locate the directory relative to it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…images Add ensure_ecr_task() helper to ol_concourse/lib/containers.py (mirrors the pattern used in the dagster docker_pulumi_pipeline). The task runs the AWS CLI to check for the ECR repository and creates it if missing, so the first pipeline run does not fail on a missing registry. Apply to both grader image pipelines: - base_image_pipeline: ensures mitodl/xqueue-watcher-grader-base exists before pushing to ECR - build_pipeline: ensures the per-grader ECR repo (config.ecr_repo_name) exists before pushing the course grader image Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When ecr_region is set, the registry-image resource automatically constructs the full ECR URI as {account}.dkr.ecr.{region}.amazonaws.com/{repository}. Passing the full URI in image_repository caused the hostname to be doubled in API calls, resulting in NAME_UNKNOWN errors. - Remove ecr_image_uri property from GraderPipelineConfig - Fix grader_base_ecr_repo default to use repo-name-only string - Change registry_image(image_repository=config.ecr_image_uri) to registry_image(image_repository=config.ecr_repo_name) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The grader-images-pipeline-code git resource was tracking 'main', but the pipeline files don't exist on main yet. Switch to the feature branch until this work is merged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ol_concourse/lib/containers.py

The graders-mit-600x repository is private. Switch the git resource from an HTTPS git_repo to an ssh_git_repo so Concourse can clone it. The SSH private key is read from Vault at ((github.ssh_private_key)). - Import ssh_git_repo instead of git_repo - Add github_private_key field to GraderPipelineConfig (defaults to ((github.ssh_private_key))) - Update grader_repo_url in GRADER_PIPELINES to use SSH form (git@github.com:mitodl/graders-mit-600x) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

infrastructure/github has no generic SSH key. The correct key for cloning private mitodl repos from the infrastructure Concourse team is odlbot_private_ssh_key in infrastructure/open_api_clients. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…onfig + SERVER_REF Queue configs (CONNECTIONS, HANDLERS, ContainerGrader KWARGS) are now stored as plaintext in Pulumi stack YAML files under xqwatcher:queues. The xqueue server URL is stored under xqwatcher:xqueue_server_url. SERVER_REF is injected at deploy time so xqueue-watcher resolves credentials at runtime from xqueue_servers.json, which is mounted from a Vault-synced Kubernetes Secret. The secret is sourced from the same secret-{env_prefix}/edx-xqueue Vault KV path already used by the xqueue and edxapp deployments (xqwatcher_password field), eliminating the separate xqwatcher-specific KV mount and SOPS secrets files. Changes: - __main__.py: remove SOPS read, vault.kv.SecretV2, vault_mount_stack StackReference, and XQWATCHER_HTTP_BASIC_AUTH env var; read queues config from Pulumi config; inject SERVER_REF into each queue entry; move grader_config.json into ConfigMap; add xqueue_servers.json Vault-synced secret from secret-{env_prefix}/edx-xqueue; update Deployment volumes/mounts accordingly - xqwatcher_server_policy.hcl: remove secret-xqwatcher/* path - All 9 stack YAML files: add xqueue_server_url and queues config Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ol_infrastructure/applications/xqwatcher/__main__.py

- Add AWS_DEFAULT_REGION=us-east-1 to ensure_ecr_task params so the AWS CLI knows which region to use without relying on worker defaults - Remove spurious service_account_name kwarg from OLVaultK8SResourcesConfig instantiation in OLEKSAuthBinding; the field does not exist on the model and the name is derived internally from application_name - Fix liveness probe to use 'uv run --no-sync python' instead of bare 'python', which would fail with ModuleNotFoundError because xqueue_watcher is only available inside the uv virtual environment Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Migrates the xqueue-watcher infrastructure from an EC2/ASG deployment to Kubernetes, updating Vault access and stack configuration, and adding Concourse pipelines to build/publish grader container images used by the new ContainerGrader flow.

Changes:

Add xqwatcher to shared enums used for labeling.
Replace the xqwatcher stack’s EC2 resources with Kubernetes resources (Deployment, RBAC, ConfigMap, Vault Secrets Operator integration).
Add Concourse pipelines to build a grader base image and course-specific grader images, and update stack YAML configs for the new K8s-based deployment.

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
src/ol_infrastructure/lib/ol_types.py	Adds `xqwatcher` to enums used for consistent label generation.
src/ol_infrastructure/components/applications/eks.py	Extends `OLEKSAuthBinding` to optionally create IRSA ServiceAccount(s).
src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl	Adjusts Vault policy to allow reading xqueue credentials from the shared secret path.
src/ol_infrastructure/applications/xqwatcher/main.py	Replaces EC2-based deployment with K8s Deployment + RBAC + ConfigMap + VSO-managed secrets.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.QA.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.Production.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.CI.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.QA.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.Production.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.CI.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.QA.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.Production.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx-staging.CI.yaml	Updates stack config from EC2 params to K8s params + queue definitions.
src/ol_concourse/pipelines/open_edx/grader_images/meta.py	Adds a self-updating meta pipeline that creates/updates grader image pipelines.
src/ol_concourse/pipelines/open_edx/grader_images/build_pipeline.py	Adds reusable pipeline generator for course-specific grader images.
src/ol_concourse/pipelines/open_edx/grader_images/base_image_pipeline.py	Adds pipeline generator for building/publishing the shared grader base image.
src/ol_concourse/pipelines/open_edx/grader_images/init.py	Initializes the new grader_images pipeline package.
src/ol_concourse/lib/containers.py	Adds a reusable task step to ensure an ECR repository exists before pushing.
src/bridge/secrets/xqwatcher/secrets.mitx.ci.yaml	Updates encrypted xqwatcher grading configuration secrets for the new backend.
.github/codeql/codeql-config.yml	Adds CodeQL config to exclude `.github` from actions extraction failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/ol_concourse/pipelines/open_edx/grader_images/meta.py

src/ol_concourse/pipelines/open_edx/grader_images/base_image_pipeline.py

src/ol_concourse/pipelines/open_edx/grader_images/build_pipeline.py

src/ol_infrastructure/applications/xqwatcher/__main__.py

Replace the old Packer-based xqwatcher pipeline with a Docker+Pulumi pipeline that mirrors the xqueue pattern: - Watches mitodl/xqueue-watcher (main) for new commits - Builds and pushes the Docker image to DockerHub as mitodl/xqueue-watcher:{release} - Passes the built image digest as XQWATCHER_DOCKER_DIGEST to each Pulumi stack so the Deployment rolls to the exact image SHA Update meta.py to generate docker-pulumi-xqwatcher-{release} pipelines instead of the retired packer-pulumi ones. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Unpin grader_images meta pipeline from feature branch; track main - Unpin xqueue-watcher base image source from dev branch; track main - Unpin graders-mit-600x grader repo from feature branch; track main - Fix base_image_pipeline.py docstring: downstream pipelines trigger off the DockerHub push, not the ECR push - Add xqwatcher:docker_tag config fallback for XQWATCHER_DOCKER_DIGEST so pulumi up can run without the env var set (matches xqueue pattern) - Remove env vars that duplicate xqwatcher.json ConfigMap values (POLL_TIME, REQUESTS_TIMEOUT, POLL_INTERVAL, FOLLOW_CLIENT_REDIRECTS); keep only LOGIN_POLL_INTERVAL and GRADER_* which are not in the ConfigMap - Update PR description: image is on DockerHub, not GHCR Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ol_infrastructure/applications/xqwatcher/__main__.py

Register the MIT 6.686x course-specific grader image in GRADER_PIPELINES so the meta pipeline creates a build-graders-mit-686x-image Concourse pipeline that tracks the graders-mit-686x repo and pushes to ECR at mitodl/graders-mit-686x. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ol_infrastructure/applications/xqwatcher/__main__.py

Add the edxorg-686x queue to the mitxonline production xqwatcher stack using the ContainerGrader handler, replacing the legacy JailedGrader configuration in confd_json. This is in preparation for deployment of the xqueue-watcher changes in mitodl/xqueue-watcher#14. The memory limit is set to 1Gi (vs 512Mi for 600x) to accommodate the torch dependency used by the mnist problem set graders. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add an "edxorg" entry to the xqueue_servers.json Vault template so that queues using SERVER_REF "edxorg" resolve credentials for https://xqueue.edx.org. The template variables edxorg_xqueue_username and edxorg_xqueue_password must be added to the existing edx-xqueue Vault KV secret. Update the queue config loop to use setdefault so that queues can declare their own SERVER_REF in the Pulumi stack config rather than always being assigned "default". Set SERVER_REF: edxorg on the edxorg-686x queue in the mitxonline production stack config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ofile Accommodates changes from xqueue-watcher commit 15fdd86 (security: harden containergrader and XQueue client): - Expose XQWATCHER_VERIFY_TLS via xqwatcher:verify_tls Pulumi config (default "true"; set "false" only for dev envs with self-signed certs). - Expose XQWATCHER_SUBMISSION_SIZE_LIMIT via xqwatcher:submission_size_limit Pulumi config (default 1 MB, matching containergrader default). - Add RuntimeDefault seccomp profile to the xqwatcher pod's PodSecurityContextArgs, mirroring the profile now applied to grading Jobs in containergrader.py for defence-in-depth consistency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

blarghmatey · 2026-03-23T17:14:06Z

/gemini review

Copilot

Pull request overview

Copilot reviewed 20 out of 21 changed files in this pull request and generated 16 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.Production.yaml

src/ol_concourse/pipelines/open_edx/grader_images/base_image_pipeline.py

src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.QA.yaml

...frastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.Production.yaml

src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.CI.yaml

src/ol_infrastructure/applications/xqwatcher/__main__.py

src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitxonline.CI.yaml

src/ol_infrastructure/applications/xqwatcher/__main__.py

src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.QA.yaml

gemini-code-assist

Code Review

This pull request represents a significant refactor, migrating the xqueue-watcher service from an EC2-based deployment to a Kubernetes-based deployment on EKS. Key changes include updating Pulumi configurations to define Kubernetes-specific settings for xqwatcher queues and container grader parameters, and a complete rewrite of the main Pulumi application file to provision Kubernetes Deployments, ConfigMaps, Secrets (via Vault K8s Secrets Operator), and RBAC roles. New Concourse pipelines have been introduced to build and publish xqueue-watcher base and course-specific grader images to DockerHub and ECR, along with a meta-pipeline to manage them. Additionally, a utility function ensure_ecr_task was added for creating ECR repositories, and Vault policies and EKS authentication bindings were updated to support the new Kubernetes deployment model. The review comments highlight the need to use immutable tags for grader images in production environments instead of :latest to ensure predictable deployments, and suggest refactoring the duplicated queues configuration across multiple Pulumi stack files for improved maintainability.

src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.Production.yaml

src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.CI.yaml

- Use ':' separator for tags and '@' only for sha256 digests when building the xqwatcher docker_image_ref; rename env var from XQWATCHER_DOCKER_DIGEST to XQWATCHER_DOCKER_TAG to match the config key name - Fix misleading comment on ECR base image resource in base_image_pipeline.py: downstream grader pipelines trigger off DockerHub, not ECR - Remove automount_service_account_token=False from IRSA ServiceAccount created by OLEKSAuthBinding so the projected token is mounted and IRSA can authenticate via sts:AssumeRoleWithWebIdentity Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Add a HorizontalPodAutoscaler (autoscaling/v2) targeting the xqwatcher Deployment, scaling on: - CPU: 60% average utilization - Memory: 80% average utilization Scale-up is aggressive (up to 100% more pods per minute, 60s stabilization) to handle submission bursts; scale-down is conservative (≤25% reduction per minute, 5-minute stabilization) to avoid thrashing. Min/max replica bounds are configurable via xqwatcher:min_replicas and xqwatcher:max_replicas stack config (defaults: 1 and 5). The Deployment gains ignore_changes=["spec.replicas"] so Pulumi does not revert the replica count that the HPA manages between stack updates. Exports k8s_hpa_name for stack consumers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The original design embedded edxorg credentials in the same VaultStaticSecret as the MIT-hosted xqueue server, referencing edxorg_xqueue_username / edxorg_xqueue_password keys that do not exist in secret-<env>/edx-xqueue (which only holds edxapp_password and xqwatcher_password). Instead, create a fully independent Deployment per xqueue server: - xqwatcher (default): watches queues targeting the MIT-hosted xqueue. Reads credentials from secret-<env>/edx-xqueue. Only queues with SERVER_REF="default" (or no SERVER_REF) are included in its ConfigMap. - xqwatcher-edxorg (optional): watches queues with SERVER_REF="edxorg". Reads credentials from a separate secret-<env>/edxorg-xqueue Vault path, created only when xqwatcher:edxorg_xqueue_enabled=true. Each Deployment has its own ConfigMap (scoped to its own queue subset), VaultStaticSecret, and HPA; they share the xqwatcher ServiceAccount and RBAC Role since both need identical permissions to manage grading Jobs. The Vault policy gains a read grant for secret-DEPLOYMENT/edxorg-xqueue. This eliminates the need for a file-merge initContainer and gives each server integration independent observability, scaling, and secret access. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ol_concourse/pipelines/open_edx/xqwatcher/docker_pulumi_pipeline.py

blarghmatey and others added 5 commits March 11, 2026 12:22

feat(ol_types): add xqwatcher to Services and Application enums

de52643

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5613460

for more information, see https://pre-commit.ci

blarghmatey requested a review from Copilot March 18, 2026 18:50

Copilot AI reviewed Mar 18, 2026

View reviewed changes

blarghmatey and others added 3 commits March 18, 2026 15:35

chore: Use xqwatcher image from dockerhub pull-through cache

3b62704

sentry bot reviewed Mar 18, 2026

View reviewed changes

src/ol_infrastructure/components/applications/eks.py Show resolved Hide resolved

blarghmatey and others added 4 commits March 18, 2026 16:44

sentry bot reviewed Mar 18, 2026

View reviewed changes

src/ol_infrastructure/applications/xqwatcher/__main__.py Show resolved Hide resolved

blarghmatey and others added 11 commits March 18, 2026 17:22

sentry bot reviewed Mar 19, 2026

View reviewed changes

src/ol_concourse/lib/containers.py Show resolved Hide resolved

blarghmatey and others added 2 commits March 19, 2026 15:48

blarghmatey and others added 2 commits March 20, 2026 11:28

fix: Update mitx CI watcher password to match xqueue

a4fc006

sentry bot reviewed Mar 20, 2026

View reviewed changes

src/ol_infrastructure/applications/xqwatcher/__main__.py Outdated Show resolved Hide resolved

blarghmatey requested a review from Copilot March 20, 2026 17:00

Copilot started reviewing on behalf of blarghmatey March 20, 2026 17:01 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

blarghmatey and others added 2 commits March 20, 2026 13:22

sentry bot reviewed Mar 20, 2026

View reviewed changes

src/ol_infrastructure/applications/xqwatcher/__main__.py Show resolved Hide resolved

blarghmatey and others added 2 commits March 20, 2026 14:06

Delete .github/codeql/codeql-config.yml

d1f4d77

sentry bot reviewed Mar 20, 2026

View reviewed changes

src/ol_infrastructure/applications/xqwatcher/__main__.py Outdated Show resolved Hide resolved

blarghmatey and others added 3 commits March 20, 2026 16:04

blarghmatey requested a review from Copilot March 23, 2026 17:13

Copilot started reviewing on behalf of blarghmatey March 23, 2026 17:14 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 23, 2026

View reviewed changes

src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.Production.yaml Show resolved Hide resolved

src/ol_infrastructure/applications/xqwatcher/Pulumi.applications.xqwatcher.mitx.CI.yaml Show resolved Hide resolved

blarghmatey and others added 2 commits March 23, 2026 13:51

Apply suggestion from @Copilot

a9df683

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

feoh approved these changes Mar 24, 2026

View reviewed changes

blarghmatey and others added 3 commits March 24, 2026 12:28

chore: Update watcher pipeline

f5ffad6

sentry bot reviewed Mar 24, 2026

View reviewed changes

src/ol_concourse/pipelines/open_edx/xqwatcher/docker_pulumi_pipeline.py Show resolved Hide resolved

blarghmatey merged commit caf670a into main Mar 24, 2026
7 of 8 checks passed

blarghmatey deleted the feat/xqwatcher-kubernetes-migration branch March 24, 2026 18:15

Conversation

blarghmatey commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

src/ol_infrastructure/lib/ol_types.py

src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl

src/ol_infrastructure/applications/xqwatcher/__main__.py

Stack configs (9 files)

Deployment Prerequisites

Related PRs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blarghmatey commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

blarghmatey commented Mar 11, 2026 •

edited

Loading

`src/ol_infrastructure/lib/ol_types.py`

`src/ol_infrastructure/applications/xqwatcher/xqwatcher_server_policy.hcl`

`src/ol_infrastructure/applications/xqwatcher/main.py`